Background:
Necrotizing enterocolitis (NEC) is a gastrointestinal disease that is the leading cause of premature infant death in the NICU, with a mortality rate of 20–50%. NEC is difficult to diagnose due to late stage signs such as abdominal distension, blood in stool, microbial imbalance, and gas in intestinal walls (pneumatosis intestinalis) which quickly progresses to local and systemic inflammation, multi-organ failure, and death. Even if the premature infant survives NEC, they may lead a decreased quality of life due to permanent bowel issues and neurodevelopmental delays from NEC exposure. The NEC profile is characterized by up-regulation of pro-inflammatory markers such as IL-6, IL-8, IL-1B, and TNF alpha.
Hypothesis:
We hypothesize that the development and severity of NEC can be attenuated by through inhibition of pro-inflammation pathways.
Dataset Information
SRA: SRP272342
file type: fastq.gz
Summary: Comparison of intestinal epithelial cell tissue of breast fed (control) and formula fed (NEC model) mice at P4
Importing Libraries
library(tximport)
library(readr)
library(tximportData)
library(DESeq2) #nra-seq analysis
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply,
parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins,
colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads,
colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks,
colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans,
colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs,
rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2,
rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
library(pheatmap)
library(vsn)
library(RColorBrewer)
library(gplots)
Attaching package: 'gplots'
The following object is masked from 'package:IRanges':
space
The following object is masked from 'package:S4Vectors':
space
The following object is masked from 'package:stats':
lowess
library(ggplot2)
library(genefilter)
Attaching package: 'genefilter'
The following objects are masked from 'package:MatrixGenerics':
rowSds, rowVars
The following objects are masked from 'package:matrixStats':
rowSds, rowVars
The following object is masked from 'package:readr':
spec
library(AnnotationDbi)
library(org.Mm.eg.db)
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:AnnotationDbi':
select
The following object is masked from 'package:Biobase':
combine
The following object is masked from 'package:matrixStats':
count
The following objects are masked from 'package:GenomicRanges':
intersect, setdiff, union
The following object is masked from 'package:GenomeInfoDb':
intersect
The following objects are masked from 'package:IRanges':
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from 'package:S4Vectors':
first, intersect, rename, setdiff, setequal, union
The following objects are masked from 'package:BiocGenerics':
combine, intersect, setdiff, union
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(DiagrammeR)
library(GenomicFeatures)
library(apeglm)
library(gage)
library(gageData)
library(pathview)
##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
formally cite the original Pathview paper (not just mention it) in publications
or products. For details, do citation("pathview") within R.
The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
license agreement (details at http://www.kegg.jp/kegg/legal.html).
##############################################################################
library(biomaRt)
library(clusterProfiler)
clusterProfiler v4.0.0 For help: https://guangchuangyu.github.io/software/clusterProfiler
If you use clusterProfiler in published research, please cite:
Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
Attaching package: 'clusterProfiler'
The following object is masked from 'package:biomaRt':
select
The following object is masked from 'package:AnnotationDbi':
select
The following object is masked from 'package:IRanges':
slice
The following object is masked from 'package:S4Vectors':
rename
The following object is masked from 'package:stats':
filter
Design
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
# edge definitions with the node IDs
tab1 -> tab2 -> tab3 -> tab4 -> tab5;
}
[1]: 'Obtain data'
[2]: 'Process and clean raw data: fastq to readCounts'
[3]: 'Observe sample quality'
[4]: 'Data Visualization'
[5]: 'Pathway Analysis'
")
Obtain data
Downloading raw, zipped files from NCBI GEO repository
#process ()
#{
# fasterq-dump $SRR
# mv ${SRR}_1.fastq ${sample}_1.fastq; mv ${SRR}_2.fastq ${sample}_2.fastq
# gzip ${sample}_1.fastq ${sample}_2.fastq
#}
#for GSM in $(grep -i -v i3c GSMtable.txt | cut -f1)
#do
# SRR=$(grep $GSM runInfo.txt | cut -f1)
# sample=$(grep $GSM GSMtable.txt | cut -f2 | awk '{gsub(" ", ""); print}')
# #echo $sample
# process &
#done
#wait
#echo done
Setting up files for DESeq2 Analysis
# Setting working directory
dir <- "/home/aaltamirano/Documents/nsbb552/quants/Salmon"
list.files(dir)
[1] "0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf" "mm10.refGene.gtf"
[3] "sample.e144_1_bf" "sample.e144_14_ff"
[5] "sample.e144_17_ff" "sample.e144_19_ff"
[7] "sample.e145_15_ff" "sample.e145_2_bf"
[9] "sample.e145_3_bf" "sample.e145_4_bf"
[11] "samples.txt"
# Make a gene ID x Transcript name data frame from reference genome (tx2gene)
# Order of columns matters
txdb <- makeTxDbFromGFF("/home/aaltamirano/Documents/nsbb552/quants/Salmon/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
ignored.OK
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(txdb, keys = k, keytype = "TXNAME", columns = "GENEID")
'select()' returned 1:1 mapping between keys and columns
head(tx2gene)
#Make a samples.txt file with sample IDs - should match the sample IDs used for alignment/mapping
#Assign samples.txt to the samples variable
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples$condition <- factor(rep(c("A","B"),each=4)) #Adds a condition column
samples
# A=control B=NEC
files <- file.path(dir, samples$sample, "quant.sf")
names(files) <- paste0(samples$sample)
#files
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
reading in files with read_tsv
1 2 3 4 5 6 7 8
transcripts missing from tx2gene: 176
summarizing abundance
summarizing counts
summarizing length
#names(txi.salmon)
head(txi.salmon$counts)
sample.e145_2_bf sample.e145_3_bf sample.e145_4_bf sample.e144_1_bf sample.e144_14_ff
ENSMUSG00000000001 592 446 808 816.000 1076.000
ENSMUSG00000000003 0 0 0 0.000 0.000
ENSMUSG00000000028 41 25 72 31.000 55.000
ENSMUSG00000000037 10 9 15 21.001 24.000
ENSMUSG00000000049 2 1 3 5.000 4.000
ENSMUSG00000000056 239 157 345 253.000 292.089
sample.e144_17_ff sample.e144_19_ff sample.e145_15_ff
ENSMUSG00000000001 771 1071.371 803
ENSMUSG00000000003 0 0.000 0
ENSMUSG00000000028 69 75.000 68
ENSMUSG00000000037 17 21.000 35
ENSMUSG00000000049 2 3.000 1
ENSMUSG00000000056 207 339.000 294
#all(file.exists(files))
#head(files)
#file.exists(files)
data.frame(thefiles = files, doihave = file.exists(files))
Loading deseq2
dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition)
using counts and average transcript lengths from tximport
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds$condition <- factor(dds$condition, levels = c("A","B"))
Processing data
dds <- DESeq(dds)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
dds
class: DESeqDataSet
dim: 18036 8
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(18036): ENSMUSG00000000001 ENSMUSG00000000028 ... ENSMUSG00000118474 ENSMUSG00000118486
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(8): sample.e145_2_bf sample.e145_3_bf ... sample.e144_19_ff sample.e145_15_ff
colData names(2): sample condition
res <- results(dds)
#res <- res[res$log2FoldChange >= 1, ]
summary(res)
out of 18036 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 824, 4.6%
LFC < 0 (down) : 1057, 5.9%
outliers [1] : 11, 0.061%
low counts [2] : 3144, 17%
(mean count < 8)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Clean data
## log fold change shrinkage for visualization and ranking
resultsNames(dds)
[1] "Intercept" "condition_B_vs_A"
resLFC <- lfcShrink(dds, coef="condition_B_vs_A", type="apeglm")
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
sequence count data: removing the noise and preserving large differences.
Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
resLFC
log2 fold change (MAP): condition B vs A
Wald test p-value: condition B vs A
DataFrame with 18036 rows and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000000001 779.95385 0.05630551 0.138118 0.587677 0.801338
ENSMUSG00000000028 51.85717 0.11377348 0.187239 0.294300 0.574282
ENSMUSG00000000037 18.06297 0.05102431 0.192146 0.539494 0.773729
ENSMUSG00000000049 3.04633 -0.03246901 0.205843 0.389853 NA
ENSMUSG00000000056 262.93131 0.00273615 0.148849 0.980164 0.991664
... ... ... ... ... ...
ENSMUSG00000118434 17.96429 0.00680990 0.188574 0.930129 0.971692
ENSMUSG00000118458 31.01807 -0.00784403 0.187435 0.917789 0.967527
ENSMUSG00000118462 1.56485 -0.02126945 0.205485 0.457317 NA
ENSMUSG00000118474 2.35901 0.01396206 0.203058 0.710999 NA
ENSMUSG00000118486 1.28008 0.01290908 0.204958 0.599901 NA
## Organizing results by smallest P-value
resOrdered <- res[order(res$pvalue),]
resOrdered
log2 fold change (MLE): condition B vs A
Wald test p-value: condition B vs A
DataFrame with 18036 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000030017 310.8436 4.59658 0.259441 17.71726 3.08587e-70 4.59209e-66
ENSMUSG00000071356 1057.5221 2.79887 0.248813 11.24886 2.34592e-29 1.74548e-25
ENSMUSG00000069917 1299.4493 -2.90978 0.271248 -10.72741 7.56910e-27 3.75452e-23
ENSMUSG00000073940 99.9417 -3.66798 0.364057 -10.07528 7.10605e-24 2.64363e-20
ENSMUSG00000037033 552.1574 1.77785 0.178946 9.93511 2.92839e-23 8.71547e-20
... ... ... ... ... ... ...
ENSMUSG00000084319 53.70694 -1.19994 3.22936 -0.371572 NA NA
ENSMUSG00000084383 16.35271 -22.18544 3.38615 -6.551823 NA NA
ENSMUSG00000094248 5.23463 5.62342 3.38978 1.658933 NA NA
ENSMUSG00000096592 7.38043 -6.53535 3.38952 -1.928109 NA NA
ENSMUSG00000114942 11.80172 20.08815 3.38622 5.932327 NA NA
## How many adjusted p-values were less than 0.01?
sum(res$padj < 0.1, na.rm=TRUE)
[1] 1881
## Changing p-value to 0.05
res05 <- results(dds, alpha=0.05)
summary(res05)
out of 18036 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 515, 2.9%
LFC < 0 (down) : 706, 3.9%
outliers [1] : 11, 0.061%
low counts [2] : 3494, 19%
(mean count < 10)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
## How many adjusted p-values were less htan 0.05?
sum(res05$padj < 0.05, na.rm=TRUE)
[1] 1221
Observe quality of data
plotMA(res, ylim=c(-2,2))

plotMA(resLFC, ylim=c(-2,2))

#checked
plotDispEsts(dds, main="Dispersion plot")

mcols(resLFC)$description
[1] "mean of normalized counts for all samples" "log2 fold change (MAP): condition B vs A"
[3] "posterior SD: condition B vs A" "Wald test p-value: condition B vs A"
[5] "BH adjusted p-values"
ntd <- normTransform(dds)
rld <- rlogTransformation(dds)
head(assay(rld))
sample.e145_2_bf sample.e145_3_bf sample.e145_4_bf sample.e144_1_bf sample.e144_14_ff
ENSMUSG00000000001 9.586836 9.432189 9.459377 9.772670 9.740088
ENSMUSG00000000028 5.699223 5.549135 5.750399 5.518271 5.665467
ENSMUSG00000000037 4.113784 4.070128 4.068480 4.285997 4.162325
ENSMUSG00000000049 1.487834 1.424627 1.442311 1.594476 1.463742
ENSMUSG00000000056 8.274156 7.880116 7.883624 8.022380 8.202945
ENSMUSG00000000058 6.058225 6.075010 6.458566 6.145500 6.529041
sample.e144_17_ff sample.e144_19_ff sample.e145_15_ff
ENSMUSG00000000001 9.678711 9.676637 9.394747
ENSMUSG00000000028 5.884028 5.717651 5.699630
ENSMUSG00000000037 4.180546 4.124824 4.298698
ENSMUSG00000000049 1.478057 1.441119 1.402578
ENSMUSG00000000056 7.840829 8.116099 7.916973
ENSMUSG00000000058 5.953610 6.106999 6.163559
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
sample.e145_2_bf sample.e145_3_bf sample.e145_4_bf sample.e144_1_bf sample.e144_14_ff
ENSMUSG00000000001 9.717893 9.482189 9.525596 9.995258 9.946287
ENSMUSG00000000028 6.871501 6.655288 6.936889 6.619648 6.826289
ENSMUSG00000000037 6.148628 6.074012 6.087192 6.415225 6.230451
sample.e144_17_ff sample.e144_19_ff sample.e145_15_ff
ENSMUSG00000000001 9.855570 9.852053 9.427373
ENSMUSG00000000028 7.116053 6.894127 6.871023
ENSMUSG00000000037 6.258165 6.175734 6.419137
meanSdPlot(assay(ntd))

meanSdPlot(assay(vsd))

meanSdPlot(assay(rld))

#Shows the effect of the transformation, i
ddsESF <- estimateSizeFactors(dds)
using 'avgTxLength' from assays(dds), correcting for library size
df1 <- data.frame(log2(counts(ddsESF, normalized=TRUE)[, 1:2] + 1))
df1$transformation <- "log2(x + 1)"
df2 <- data.frame(assay(rld)[, 1:2])
df2$transformation <- "rld"
df3 <- data.frame(assay(vsd)[, 1:2])
df3$transformation <- "vsd"
df <- rbind(df1, df2, df3)
colnames(df)[1:2] <- c("x", "y")
head(df)
table(df$transformation)
log2(x + 1) rld vsd
18036 18036 18036
options(repr.plot.width=6, repr.plot.height=2.5)
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)
#rld graph compresses differences for the low count genes > excludes genes with low counts
par(mfrow=c(1,3))

boxplot(log2(assay(ddsESF)+1), las=2, main="log2(x+1)")
boxplot(assay(rld), las=2, main="rld")
boxplot(assay(vsd), las=2, main="vsd")

select <- order(rowMeans(counts(dds,normalized=TRUE)),
decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)["condition"])
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
Found more than one class "unit" in cache; using the first, from namespace 'ggbio'
Also defined by 'hexbin'
Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
Also defined by 'hexbin'
Found more than one class "simpleUnit" in cache; using the first, from namespace 'ggbio'
Also defined by 'hexbin'

#
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(samples$condition))])
[1] "#1B9E77" "#D95F02"
sampleDists <- as.matrix(dist(t(assay(rld))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
col=colorpanel(100, "blue", "white"),
ColSideColors=mycols[samples$condition],
RowSideColors=mycols[samples$condition],
margin=c(10, 10), main="Sample Distance Matrix")

sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$condition, rld$sample, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)

#PCA Alternative code
#DESeq2::plotPCA(rld, intgroup="condition")
pcaData <- plotPCA(rld, intgroup="condition", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()

table(resLFC$padj<0.05)
FALSE TRUE
13675 1206
res <- resLFC[order(resLFC$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
Data Visualization
# Volcano plot
with(resLFC, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot",ylim=c(0,10), xlim=c(-2,2)))
# Add colored points: red if padj<0.05, orange of log2FC>1, green if both)
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="orange"))
with(subset(res, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))

#Assigning the top 20 significant genes
topVargenes20 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 20)
topVargenes500 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 500)
#Making a heatmap of the top 20 significant genes
mat <- assay(rld)[topVargenes20, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, annotation_col = anno)

mat <- assay(rld)[topVargenes500, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, scale = "row", show_rownames = F,
clustering_method = 'average', annotation_col = anno, cutree_cols = 3)

Pathway Analysis
#Creating datasets with KEGG gene sets to test
kegg_mouse <- kegg.gsets(species = "mouse", id.type = "kegg")
names(kegg_mouse)
[1] "kg.sets" "sigmet.idx" "sig.idx" "met.idx" "dise.idx"
kegg.gs <- kegg_mouse$kg.sets[kegg_mouse$sigmet.idx]
res$symbol = mapIds(org.Mm.eg.db,
keys=row.names(res),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
'select()' returned 1:many mapping between keys and columns
res$entrez = mapIds(org.Mm.eg.db,
keys=row.names(res),
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
'select()' returned 1:many mapping between keys and columns
res$name = mapIds(org.Mm.eg.db,
keys=row.names(res),
column="GENENAME",
keytype="ENSEMBL",
multiVals="first")
'select()' returned 1:many mapping between keys and columns
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
#Running GAGE
keggres = gage(foldchanges, gsets=kegg.gs, same.dir = T)
names(keggres)
[1] "greater" "less" "stats"
lapply(keggres, head) #Up-regulated and down-regulated pathways
$greater
p.geomean stat.mean p.val q.val set.size exp1
mmu03013 RNA transport 0.001703931 2.952683 0.001703931 0.2109217 160 0.001703931
mmu04610 Complement and coagulation cascades 0.001787472 2.969962 0.001787472 0.2109217 71 0.001787472
mmu03010 Ribosome 0.004307706 2.658194 0.004307706 0.3388729 134 0.004307706
mmu04621 NOD-like receptor signaling pathway 0.007573771 2.443348 0.007573771 0.4468525 172 0.007573771
mmu03040 Spliceosome 0.009753212 2.354265 0.009753212 0.4603516 128 0.009753212
mmu04380 Osteoclast differentiation 0.015603079 2.171236 0.015603079 0.6137211 116 0.015603079
$less
p.geomean stat.mean p.val q.val set.size exp1
mmu04142 Lysosome 5.775434e-06 -4.484937 5.775434e-06 0.001363002 121 5.775434e-06
mmu00100 Steroid biosynthesis 1.848401e-03 -3.148434 1.848401e-03 0.138864576 18 1.848401e-03
mmu01212 Fatty acid metabolism 2.241459e-03 -2.900056 2.241459e-03 0.138864576 59 2.241459e-03
mmu00071 Fatty acid degradation 2.353637e-03 -2.911957 2.353637e-03 0.138864576 42 2.353637e-03
mmu00531 Glycosaminoglycan degradation 4.268060e-03 -2.801355 4.268060e-03 0.162815564 18 4.268060e-03
mmu01240 Biosynthesis of cofactors 4.637380e-03 -2.619927 4.637380e-03 0.162815564 141 4.637380e-03
$stats
stat.mean exp1
mmu03013 RNA transport 2.952683 2.952683
mmu04610 Complement and coagulation cascades 2.969962 2.969962
mmu03010 Ribosome 2.658194 2.658194
mmu04621 NOD-like receptor signaling pathway 2.443348 2.443348
mmu03040 Spliceosome 2.354265 2.354265
mmu04380 Osteoclast differentiation 2.171236 2.171236
#head(keggres$greater) #Up-regulated pathways
#head(keggres$less) # Down-regulated pathways
# Explore the top 20 up-regulated pathways and KEGG IDs
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number()<=20) %>%
.$id %>%
as.character()
`tbl_df()` was deprecated in dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.
keggrespathways
[1] "mmu03013 RNA transport" "mmu04610 Complement and coagulation cascades"
[3] "mmu03010 Ribosome" "mmu04621 NOD-like receptor signaling pathway"
[5] "mmu03040 Spliceosome" "mmu04380 Osteoclast differentiation"
[7] "mmu04924 Renin secretion" "mmu04662 B cell receptor signaling pathway"
[9] "mmu04657 IL-17 signaling pathway" "mmu03008 Ribosome biogenesis in eukaryotes"
[11] "mmu00970 Aminoacyl-tRNA biosynthesis" "mmu04670 Leukocyte transendothelial migration"
[13] "mmu04261 Adrenergic signaling in cardiomyocytes" "mmu04970 Salivary secretion"
[15] "mmu04972 Pancreatic secretion" "mmu04672 Intestinal immune network for IgA production"
[17] "mmu04640 Hematopoietic cell lineage" "mmu04915 Estrogen signaling pathway"
[19] "mmu04024 cAMP signaling pathway" "mmu04668 TNF signaling pathway"
# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
[1] "mmu03013" "mmu04610" "mmu03010" "mmu04621" "mmu03040" "mmu04380" "mmu04924" "mmu04662" "mmu04657" "mmu03008"
[11] "mmu00970" "mmu04670" "mmu04261" "mmu04970" "mmu04972" "mmu04672" "mmu04640" "mmu04915" "mmu04024" "mmu04668"
# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse", new.signature=FALSE)
#plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse"))
'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /home/aaltamirano/Documents/nsbb552
Info: Writing image file mmu03013.pathview.png
# Open inflammation-associated pathway .png files
feh ~/Documents/nsbb552/mmu04662.pathview.png #"mmu04662 B cell receptor signaling pathway"
feh ~/Documents/nsbb552/mmu04657.pathview.png #"mmu04657 IL-17 signaling pathway"
feh ~/Documents/nsbb552/mmu04672.pathview.png #"mmu04672 Intestinal immune network for IgA production"
feh ~/Documents/nsbb552/mmu04668.pathview.png #"mmu04668 TNF signaling pathway"
---
title: "NSBB 552 Final Project"
output: html_notebook
---

## Background:

##### Necrotizing enterocolitis (NEC) is a gastrointestinal disease that is the leading cause of premature infant death in the NICU, with a mortality rate of 20--50%. NEC is difficult to diagnose due to late stage signs such as abdominal distension, blood in stool, microbial imbalance, and gas in intestinal walls (pneumatosis intestinalis) which quickly progresses to local and systemic inflammation, multi-organ failure, and death. Even if the premature infant survives NEC, they may lead a decreased quality of life due to permanent bowel issues and neurodevelopmental delays from NEC exposure. The NEC profile is characterized by up-regulation of pro-inflammatory markers such as IL-6, IL-8, IL-1B, and TNF alpha.

## Hypothesis:

##### We hypothesize that the development and severity of NEC can be attenuated by through inhibition of pro-inflammation pathways.

### Dataset Information

##### link: <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154617>

##### SRA: SRP272342

##### file type: fastq.gz

##### Summary: Comparison of intestinal epithelial cell tissue of breast fed (control) and formula fed (NEC model) mice at P4

##### Importing Libraries

```{r}
library(tximport)
library(readr)
library(tximportData)
library(DESeq2) #nra-seq analysis 
library(pheatmap)
library(vsn)
library(RColorBrewer)
library(gplots)
library(ggplot2)
library(genefilter)
library(AnnotationDbi)
library(org.Mm.eg.db)
library(dplyr)
library(DiagrammeR)
library(GenomicFeatures)
library(apeglm)
library(gage)
library(gageData)
library(pathview)
library(biomaRt)
library(clusterProfiler)
```

### Design

```{r}
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5;
      }

      [1]: 'Obtain data'
      [2]: 'Process and clean raw data: fastq to readCounts'
      [3]: 'Observe sample quality'
      [4]: 'Data Visualization'
      [5]: 'Pathway Analysis' 
      ")
```

# **Obtain data**

### Downloading raw, zipped files from NCBI GEO repository

```{bash}
#process ()
#{
#        fasterq-dump $SRR
#        mv ${SRR}_1.fastq ${sample}_1.fastq; mv ${SRR}_2.fastq ${sample}_2.fastq
#        gzip ${sample}_1.fastq  ${sample}_2.fastq
#}

#for GSM in $(grep -i -v i3c GSMtable.txt | cut -f1)
#do
#        SRR=$(grep $GSM runInfo.txt | cut -f1)
#        sample=$(grep $GSM GSMtable.txt | cut -f2 | awk '{gsub(" ", ""); print}')
#        #echo $sample
#        process &
#done
#wait
#echo done 

```

### Importing raw files into alignment tool (Salmon)

```{bash}
#for fn in sample.e144_14_ff sample.e144_17_ff sample.e144_19_ff sample.e144_1_bf sam>
#do  
#echo "Processing sample $fn"
#salmon quant -i /home/aaltamirano/Documents/nsbb552/genome_folder/alias/mm10/salmon_>
#        -1 ${fn}_1.fastq \
#        -2 ${fn}_2.fastq \
#        -p 24 --validateMappings -o quants/Salmon/${fn}
#done 
#echo "done" 
```

### Setting up files for DESeq2 Analysis

```{r}
# Setting working directory
dir <- "/home/aaltamirano/Documents/nsbb552/quants/Salmon"
list.files(dir)
```

```{r}
# Make a gene ID x Transcript name data frame from reference genome (tx2gene)
# Order of columns matters 
txdb <- makeTxDbFromGFF("/home/aaltamirano/Documents/nsbb552/quants/Salmon/0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1.gtf")
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(txdb, keys = k, keytype = "TXNAME", columns = "GENEID")
head(tx2gene)
```

```{r}
#Make a samples.txt file with sample IDs - should match the sample IDs used for alignment/mapping
#Assign samples.txt to the samples variable
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples$condition <- factor(rep(c("A","B"),each=4)) #Adds a condition column
samples
# A=control B=NEC


files <- file.path(dir, samples$sample, "quant.sf")
names(files) <- paste0(samples$sample)
#files
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
#names(txi.salmon)
head(txi.salmon$counts)

```

```{r}
#all(file.exists(files))
#head(files)
#file.exists(files)
data.frame(thefiles = files, doihave = file.exists(files))
```

### Loading deseq2

```{r}
dds <- DESeqDataSetFromTximport(txi.salmon, samples, ~condition)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
```

```{r}
dds$condition <- factor(dds$condition, levels = c("A","B"))
```

# **Processing data**

```{r}
dds <- DESeq(dds)
dds
res <- results(dds)
#res <- res[res$log2FoldChange >= 1, ]
summary(res)
```

# **Clean data**

```{r}
## log fold change shrinkage for visualization and ranking
resultsNames(dds)
resLFC <- lfcShrink(dds, coef="condition_B_vs_A", type="apeglm")
resLFC
```

```{r}
## Organizing results by smallest P-value
resOrdered <- res[order(res$pvalue),]
resOrdered

## How many adjusted p-values were less than 0.01?
sum(res$padj < 0.1, na.rm=TRUE)

## Changing p-value to 0.05 
res05 <- results(dds, alpha=0.05)
summary(res05)
## How many adjusted p-values were less htan 0.05?
sum(res05$padj < 0.05, na.rm=TRUE)
```

# **Observe quality of data**

```{r}
plotMA(res, ylim=c(-2,2))
plotMA(resLFC, ylim=c(-2,2))
```

```{r}
#checked 
plotDispEsts(dds, main="Dispersion plot")
```

```{r}
mcols(resLFC)$description
```

```{r}
ntd <- normTransform(dds)
rld <- rlogTransformation(dds)
head(assay(rld))

vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)

meanSdPlot(assay(ntd))
meanSdPlot(assay(vsd))
meanSdPlot(assay(rld))

#Shows the effect of the transformation, i
ddsESF <- estimateSizeFactors(dds)
df1 <- data.frame(log2(counts(ddsESF, normalized=TRUE)[, 1:2] + 1))
df1$transformation <- "log2(x + 1)"
df2 <- data.frame(assay(rld)[, 1:2])
df2$transformation <- "rld"
df3 <- data.frame(assay(vsd)[, 1:2])
df3$transformation <- "vsd"
df <- rbind(df1, df2, df3)
colnames(df)[1:2] <- c("x", "y")
head(df)
table(df$transformation)

options(repr.plot.width=6, repr.plot.height=2.5)
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)
#rld graph compresses differences for the low count genes > excludes genes with low counts

par(mfrow=c(1,3))
boxplot(log2(assay(ddsESF)+1), las=2, main="log2(x+1)")
boxplot(assay(rld), las=2, main="rld")
boxplot(assay(vsd), las=2, main="vsd")
```

```{r}

select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)["condition"])
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)
```

```{r}
#
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(samples$condition))])
sampleDists <- as.matrix(dist(t(assay(rld))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "blue", "white"),
          ColSideColors=mycols[samples$condition],
          RowSideColors=mycols[samples$condition],
          margin=c(10, 10), main="Sample Distance Matrix")
```

```{r}
sampleDists <- dist(t(assay(rld)))

sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$condition, rld$sample, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)
```

```{r}
#PCA Alternative code
#DESeq2::plotPCA(rld, intgroup="condition")
```

```{r}
pcaData <- plotPCA(rld, intgroup="condition", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
```

```{r}
table(resLFC$padj<0.05)
res <- resLFC[order(resLFC$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
```

# **Data Visualization**

```{r}
# Volcano plot
with(resLFC, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot",ylim=c(0,10), xlim=c(-2,2)))
# Add colored points: red if padj<0.05, orange of log2FC>1, green if both)
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(res, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="orange"))
with(subset(res, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))
```

```{r}
#Assigning the top 20 significant genes 
topVargenes20 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 20)
topVargenes500 <- head(order(rowVars(assay(rld)), decreasing = TRUE), 500)


#Making a heatmap of the top 20 significant genes 
mat <- assay(rld)[topVargenes20, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, annotation_col = anno)


mat <- assay(rld)[topVargenes500, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)["condition"])
pheatmap(mat, scale = "row", show_rownames = F,
         clustering_method = 'average',  annotation_col = anno, cutree_cols = 3)
```

# **Pathway Analysis**

```{r}
#Creating datasets with KEGG gene sets to test
kegg_mouse <- kegg.gsets(species = "mouse", id.type = "kegg")
names(kegg_mouse)
kegg.gs <- kegg_mouse$kg.sets[kegg_mouse$sigmet.idx]
```

```{r}
res$symbol = mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")
res$entrez = mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="ENTREZID",
                     keytype="ENSEMBL",
                     multiVals="first")
res$name =   mapIds(org.Mm.eg.db,
                     keys=row.names(res), 
                     column="GENENAME",
                     keytype="ENSEMBL",
                     multiVals="first")
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
```

```{r}
#Running GAGE

keggres = gage(foldchanges, gsets=kegg.gs, same.dir = T)
names(keggres)

lapply(keggres, head) #Up-regulated and down-regulated pathways
#head(keggres$greater) #Up-regulated pathways
#head(keggres$less) # Down-regulated pathways

# Explore the top 20 up-regulated pathways and KEGG IDs
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% 
  tbl_df() %>% 
  filter(row_number()<=20) %>% 
  .$id %>% 
  as.character()
keggrespathways
# Get the IDs.
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids
```

```{r}
# Define plotting function for applying later
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse", new.signature=FALSE)

#plot multiple pathways (plots saved to disk and returns a throwaway list object)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id=pid, species="mouse"))
```

```{bash}
# Open inflammation-associated pathway .png files
feh ~/Documents/nsbb552/mmu04662.pathview.png #"mmu04662 B cell receptor signaling pathway"
feh ~/Documents/nsbb552/mmu04657.pathview.png #"mmu04657 IL-17 signaling pathway"
feh ~/Documents/nsbb552/mmu04672.pathview.png #"mmu04672 Intestinal immune network for IgA production"
feh ~/Documents/nsbb552/mmu04668.pathview.png #"mmu04668 TNF signaling pathway"  

```

# **Discussion**

##### After processing and cleaning the dataset, there is some clear differential expression between the breast fed (Condition A: Control) and the formula fed (Condition B: NEC) groups. This difference is most clearly observed when comparing samples 2,3,4 and 14,15,17, and 19. The top 20 most differentiated genes didn't have a strong association with inflammation pathways. Aditionally, downstream pathway analyses of p vale\<0.05 were to stringent, so p-value\<0.1 was used. In conlcusion, this dataset does show differntial expression of genes in pro-inflammation pathways. Further pathways analyses with IPA and possible gene enrichment would be useful in pursuing these connections.

```{r}
## Doesn't make biological sense 
# Create background dataset for hypergeometric testing using all genes tested for significance in the results                  
all_genes <- as.character(rownames(res)) 
# Extract significant results 
signif_res <- res[res$padj < 0.07 & !is.na(res$padj), ] 
signif_genes <- as.character(rownames(signif_res)) 

ego <- enrichGO(gene = signif_genes, universe = all_genes, keyType = "ENSEMBL", OrgDb = org.Mm.eg.db, ont = "BP",pAdjustMethod = "BH", qvalueCutoff = 0.07, readable = TRUE) 
# Output results from GO analysis to a table 
cluster_summary <- data.frame(ego) 
#Visualizing 
dotplot(ego, showCategory=50)
#emapplot(ego, showCategory=50)

# To color genes by log2 fold changes 
signif_res_lFC <- signif_res$log2FoldChange 
cnetplot(ego, categorySize="pvalue", showCategory = 5, foldChange= signif_res_lFC, vertex.label.font=6)
```

```{r}
sessionInfo()
```

### Relevant Documentation

1.  rnaseqGene: end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages (<http://master.bioconductor.org/packages/release/workflows/html/rnaseqGene.html>)

```{r}
browseVignettes("rnaseqGene")
```

2.  DESeq2: Differential gene expression analysis based on the negative binomial distribution (<https://bioconductor.org/packages/release/bioc/html/DESeq2.html>)

```{r}
browseVignettes("DESeq2")
```

3.  Salmon: A tool for quantifying the expression of transcripts using RNA-seq data (<https://combine-lab.github.io/salmon/getting_started/#after-quantification>)

-   

    ###### Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.

4.  apeglm

-   

    ###### Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for

    sequence count data: removing the noise and preserving large differences. Bioinformatics. <https://doi.org/10.1093/bioinformatics/bty895>
